Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up fill_overpass_data #248

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

FlxPo
Copy link

@FlxPo FlxPo commented Nov 2, 2021

PR for issue #247
(this is my first PR ever, tell me if there's anything missing !)

@mpadge
Copy link
Member

mpadge commented Nov 4, 2021

@FlxPo Thanks for the PR. It's definitely a good idea, but unfortunately doesn't generally speed things up at all in the full context of the fill_overpass_data() function:

library (osmdata)
#> Data (c) OpenStreetMap contributors, ODbL 1.0. https://www.openstreetmap.org/copyright
doc <- "test.osm"

if (!file.exists (doc)) {
    q <- opq ("soest germany") |> # or wherever, but that's around 130,000 lines so big enough
        add_osm_feature (key="highway")
    osmdata_xml (q, filename="test.osm", quiet = FALSE)
}

f1 <- function (obj, doc) {
    doc <- xml2::read_xml (doc)
    obj <- get_metadata (obj, doc)
    doc <- as.character (doc)
    list (obj, doc)
}
f2 <- function (obj, doc) {
    nch <- file.info (doc)$size
    doc_xml <- xml2::read_xml (doc)
    obj <- get_metadata (obj, doc_xml)
    doc <- readChar (doc, nch)
    list (obj, doc)
}

obj <- osmdata ()
bench::mark (f1 = f1 (obj, doc),
             f2 = f2 (obj, doc))
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 f1            163ms    164ms      5.95        NA     0   
#> 2 f2            252ms    426ms      2.35        NA     2.35

Created on 2021-11-04 by the reprex package (v2.0.1.9000)

The key statistic in the table is the "median" time, which is more than twice as slow via this PR. In spite of that, i do suspect you're on to something here - the reason that it is so much slower is because you're effectively reading the document twice: once via read_xml, and then the second time via readChar. In your case, the read_xml is only used to extract the metadata. That suggests we should be able to re-implement extraction of metadata directly from a readChar version, then we'd really be getting somewhere.

That said, the extraction of metadata is currently done entirely as XML nodes. Hand coding a version for just the bits we need here would be possible, but could only really be done efficiently in C/C++, and even then really would verge of re-inventing the wheel. Even then, this actual reading is a relatively small portion of the total processing time for large queries, so it would be a lot of work for relatively very little gain. For those reasons, I think this PR should just be closed. Sorry about that, especially acknowledging that it's your first PR, but rest assured that it really does reflect a good idea - just not one that is ready to be implemented here.

I'll wait to hear back from you before closing. Thanks

@FlxPo
Copy link
Author

FlxPo commented Nov 22, 2021

Thanks for testing, i should have provided a benchmark !

I'm surprised by your results, I'm getting the opposite conclusion when I run your code (slightly modified to avoid a weird "Error: Each result must equal the first result: f1 does not equal f2" error) :

library (osmdata)
#> Data (c) OpenStreetMap contributors, ODbL 1.0. https://www.openstreetmap.org/copyright
doc <- "test.osm"

if (!file.exists (doc)) {
  q <- opq ("soest germany") |> # or wherever, but that's around 130,000 lines so big enough
    add_osm_feature (key="highway")
  osmdata_xml (q, filename="test.osm", quiet = FALSE)
}

f1 <- function (obj, doc) {
  doc <- xml2::read_xml (doc)
  obj <- get_metadata (obj, doc)
  doc <- as.character (doc)
  res <- list (obj, doc)
  return(TRUE)
}
f2 <- function (obj, doc) {
  nch <- file.info (doc)$size
  doc_xml <- xml2::read_xml (doc)
  obj <- get_metadata (obj, doc_xml)
  doc <- readChar (doc, nch)
  res <- list(obj, doc)
  return(TRUE)
}

obj <- osmdata ()
bench::mark (f1 = f1 (obj, doc),
             f2 = f2 (obj, doc))

#> A tibble: 2 x 13
#>  expression      min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory time  gc   
#>  <bch:expr> <bch:tm> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list> <lis> <lis>
#> 1 f1            852ms   852ms      1.17  424.35KB        0     1     0      852ms <lgl ~ <Rpro~ <ben~ <tib~
#> 2 f2            135ms   143ms      7.02    4.45MB        0     4     0      570ms <lgl ~ <Rpro~ <ben~ <tib~

I'm running this with R 4.1.1 (Windows 64-bit, 6 cores, 32 Go of RAM).

I also tried f1 and f2 on a much bigger file (360 Mo) : f2 completes after a few seconds, but f1 runs forever. I tried to use the bench::mark function to get precise estimates but I finally killed it, I need processing power for other tasks at the moment. I will try to complete the benchmark another time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants